Project Overview & Introduction

Authors

Jorge Bris Moreno

Keqin “Gale” Liu

Jude Moukarzel

Irene Tait

A comparison of aggregate polls, election results, and Reddit data to try to better contextualize the 2022 midterm elections.

Project Goal

The aim of this project is to use the 2022 Reddit data provided by our professors to generate an unofficial “thermometer” of Reddit concerning US political leanings, which will then be compared to aggregate polling data on the 2022 midterms to answer our two central, somewhat branching questions:

  • How closely does the unofficial Reddit “polling score” align with traditional polling data?

  • If it differs, is it more or less accurate at predicting the actual observed results (ie, the election outcome)?

  • If it does not differ, does it appear that Reddit users (or those similar to them) are being reflected in the polls themselves, or is there an effect of a poll’s release on altering sentiment on Reddit around the election? In other words, does the poll perpetuate itself? Or are the Reddit users part of the noise potentially skewing polling results?

Abstract

Historically, the party that controls the US Presidency tends to perform poorly in the midterms, as people express frustration with those currently in power by voting for the other guy. With the White House controlled by Democrats in 2022, conventional wisdom held that the USA would likely see a so-called “Red Wave,” where significant numbers of Democrats would lose seats in both state and federal systems to Republican challengers. Polling data throughout the election cycle overall seemed in line with this prediction. However, come election day, the Red Wave did not manifest - the Republican’s net gain was only nine seats in the House, and the Democrats regained the Senate. The Democrats flipped three governor’s seats, maintained their hold on all previously-held state legislative chambers, and even flipped several state legislature houses, as in Michigan and Minnesota1. What happened to the Red Wave that polling had predicted?

For our final project, Team 11 used the corpus of Reddit data to see if Reddit users had a different view than traditional polls ahead of the 2022 midterm election on November 8th. We used sentiment analysis on comments and posts containing keywords relating to the election, in order to create an unofficial Reddit “polling score” showing how the site’s users were feeling in a specific snapshot of time, month-to-month. We then compared that unofficial score to an aggregation of the traditional polling data conducted over the same time period, and tried to identify differences & similarities between the two, especially around election-significant events.

Background

Before election day in November of 2022 when Americans went to the polls for the midterm elections, the consensus in the media atmosphere seemed to expect a surge of Republican voters: the so-called “Red Wave” that some headlines were predicting2 even right up to the election. FiveThirtyEight (an opinion poll aggregator founded by statistical wunderkid Nate Silver and now owned by ABC News) seemed to agree with this trend: below is a screengrab of their 2022 overall poll aggregator:

FiveThirtyEight showed an intriguing narrative: the Democrats were more popular until the start of the year, at which point Republicans seemed to be the favorite until mid-summer. After the Democrats again pulled ahead in this aggregate poll around early August, a surge in polls predicting Republican victory occurred mid-October, just pushing Republicans ahead in popularity in the few weeks before the election3.

Prior to 2022, it had been a long while since any sitting President’s party had maintained its hold on the Senate while simultaneously losing no state legislative houses: the last time this had happened was 1934, under one Franklin Delano Roosevelt. 1934 was also the last election where the Democrats gained governors during a midterm while holding the Presidency (the Republicans last did that in 1986)4.

When the dust had settled after all ballots were counted, the Republicans had indeed taken back the House, but by a meager nine votes (in the last 25 years, the opposition on average gains about 30 seats in the House of Representatives during the midterms)[4]. Not only did they not recapture the Senate, they lost one seat, cementing the Democratic majority in the upper house. Despite flipping Nevada’s gubernatorial seat, Republicans had lost the governorships of Arizona, Maryland, and Massachusetts, resulting in a net gain for the Democrats, another unusual result. At the state level, Democrats had flipped both chambers in Michigan (handing them a governmental trifecta for the next two years5), and one chamber each in the states of Minnesota, Pennsylvania, and Alaska (although Alaska required a coalition to gain the majority).

The media response the next day was, perhaps, predictable:

In the days that followed, some headlines blamed skewed polls for misleading the media6. Others argued that the media misread what the polls had said, and the polls had been right all along7. More than a few news outlets blamed misinterpretation of “vibes” as why this was a surprising result8. But vibes are hard to quantify, so we thought to look to something a little bit more concrete.

FiveThirtyEight’s exact weighting metrics to create the graph you saw above are, unfortunately, proprietary (although they do give some hints in their documentation as to their thought-process). Thankfully, they do provide all their raw polling aggregation data to us mere mortals not employed by ABC, for which we are grateful. By using these aggregate polls to create a month-by-month snapshot of polling across the electorate, we can then compare this to how Redditors felt during that same time period, and try to tease out missing pieces from the midterm narrative that may have hidden the lack of a Red Wave.

The Data Sources

Our primary data source was the Reddit corpus, used to create our partisan-labelled posts which in turn created our unofficial Reddit poll. This was compared to aggregated polling results, downloaded from FiveThirtyEight’s library of historical polls. We specifically chose the generic ballot polls, which were polls that asked respondents which party they would vote for in any election, and was not specific to one race or another, in order to be comparable to our Reddit thermometer.

Steps Followed

  1. Data Collection: Read all the Reddit data avilable to us, and gathered polling data.
  2. Cleaning: Removed comments & submissions from subreddits that would skew the data (like r/Democrats or r/Republicans).
  3. Selection: Kept only the posts within our time frame (January 2022 to November 2022).
  4. Keywords: Scanned the data set for the political keywords we are interested in.
  5. NLP: Performed sentiment analysis on post (comments & submissions) containing our keywords.
  6. Labeling: Used the relationship between the sentiment and the keywords to label the post as “Democrat”, “Republican”, or “No Party Preference”.
  7. Machine Learning: Created ML models that could potentially be used to replicate our study by automatically labeling the posts without the need of the steps above.
  8. Results: Compared the sentiment of the labelled posts with the poll aggregation.

Assumptions

  • We are assuming that exposure to information influences people’s opinions, so we analyzed by posts (both comments and original submissions) and did not aggregate by user. This created an overall “thermometer” of how Reddit as a body was feeling. This could be changed in future iterations of the project.
  • We did not account for significant third party preferences or spoilage effects. Duverger’s law states that in political systems with the same characteristics as the United States, only two parties tend to emerge and the rest are marginalized. Thus, we did not take into account third parties.
  • We infer that if a post contains negative sentiment towards a party or believe held by the party platform, it is indicative of support for the opposition. While this is a naive assumption, it is a good approximation.
  • We are relying on the accuracy of the pretrained model “sentimentdl_use_twitter” from John Snow Labs’ Spark NLP library. We assume that the model is accurate and that the data we are analyzing is similar to the data the model was trained on.

Footnotes

  1. “2022 Election Results.” Politico. Accessed Dec 8 2024↩︎

  2. Enten, Harry. “A Republican wave in the House is still quite possible.” CNN, October 16 2022. Accessed Dec 8 2024↩︎

  3. “Latest Polls - Generic Ballot 2022.” FiveThirtyEight. Accessed Dec 2 2024↩︎

  4. Blake, Aaron. “How bad the 2022 election was for the GOP, historically speaking.” The Washington Post, November 10 2022. Accessed Dec 8 2024.↩︎

  5. Perkins, Tom. “How Michigan Democrats took control for the first time in decades.” The Guardian, November 17 2022. Accessed Dec 8 2024.↩︎

  6. Rutenberg, Jim; Bensinger, Ken; and Eder, Steve. “The ‘Red Wave’ Washout: How Skewed Polls Fed a False Election Narrative.” The New York Times, Dec 31 2022. Accessed Dec 4 2024↩︎

  7. Narea, Nicole. “The guy who got the midterms right explains what the media got wrong.” Vox Media, Nov 27 2022. Accessed Dec 8 2024↩︎

  8. Thompson, Derek. “Democrats Might Have Pulled Off the Biggest Midterm Shock in Decades.” The Atlantic, Nov 9 2022 Accessed Dec 8 2024↩︎